DOCUMENT BESOME 



EU 273 654 



HI B€0 503 



AUTHOR 
TITLE 



PUB DATE 
NOTE 



PUB TYPE 



EDRS PRICE 
DESCRIPTORS 



IDENTI^ERS 



Sarvela, Paul D. 

Discrimination Indices CoaBonly Used is Military 
Training Environments: Effects of Dc^ortvres from 
Normal Distributions. 
Apr 86 

36p.; Paper presented at the Aamial Meeting of the 
American Educational Research AssociatiM (S7th, San 
Francisco, CA, April 16-20, 1986). 
Speeches/Conference Papers (150) — teports - 
Research/Technical (143) 

MF01/PC02 Plus Postage. 

Comparative Analysis; ^Criterion Referenced Tests; 
*Item Analysis; *Mastery Tests; ^Military Training; 
Postsecondary Education; Raw Scores; Scores; 
Simulation; Statistical Analysis; ^Statistical 
Distributions; Statistical Studies; Testing ProUems; 
Test Items; Test Theory 

^Discrimination Indices; '^Item Aiscrimiwtioa 

(Tests) 



ABSTRACT 

Four discrimination indices vere compared, asing 
score distributions which were normal, bimodal, and aegatiirely 
skewed. The score distributions were systoatically varied to 
represent the common circumstances of a military training situation 
using criterion-referenced mastery tests. Three 2(Hitem tests were 
administered to 110 simulated subjects. The cvtting score on each 
test was 10 items correct. Three daltabases mere c o nstr uc ted for 
normal, bimodal, and skewed score distributions. Five item analysis 
statistics were calculated: the p statistic, two versions of the 
upper-lower group statistics, the phi coefficient, and the 
point-bi serial correlation. Analysis of variance and t^tests vere 
used to estimate differences between the discrimination index valnes. 
With normal data, the second upper-lower statistic ^odnced the 
largest discrimination values, point-bi serial next, and phi 
coefficient and the first upper-lower prcMlnoed identical, least 
discriminating values. Similar results were ^Aaiaed for the bimodal 
discrimination indices. The skewed distribotim analysis ms sli^tly 
different, with the first upper-lower results larger tiian tiie phi 
coefficients. The second upper-lower method was not significantly 
different from the point-biserial correlation. (SuggeFtions for 
choosing a method are summarized in a ^matrix and a &^ision tree). 
(GDC) 



* Reproductions supplied by EDRS are ±he best that can he made * 

* from the original docnaent. * 



EKLC 



Discrimination Indices Commonly Used in Military Training Environments: 
Effects of Departures from Nonnal Distributions 



Paul D. Sarvela, Ph.D. 



Ford Aerospace and Coramrnications Corporation 



Western Develofinent Leboratories Division 



U %. OEPARTKENT OK CtMiCATlON 
Oftice a» Educarionar Researct) «id tr.iprovflment 

EDUCAT10^:AL RESOURCES INFORMATION 
CENTER (ERIC) 

"•^l^his document has been reproduced as 
recerved from the person or organization 
originating it 

r Minor Changes have been made to improve 
reproduction quality 

• Points of view or opinions stated in this docu- 
ment do not necessarily represent official 
OERi positton or policy 



7100 Standard Drive 
Hancver, MD 21076 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (tRIC)." 



Paper Presented at 
The Annual Meeting of the American Educational Research Association 

San Francisco, 1986. 



* Paul Sarvela is now with the Dept, of Health Education, College of 
Education; Southern Illinois University, Carbondale. 



Abstract 



Military Test Analysis 
2 



The unique nature of testing in military training environments (e.g., 
criterion-referenced testing, bimodal ana skewed distributions of test scores) 
creates special problems in the selection of discrimination statistics used to 
evaluate test items. This paper describes the findings of a stucly which 
compared the results obtained from four discrimination indices when test score 
distributions were systematically varied (normal, bimodal, and negatively 
skewed) to represent common military test score distributions. A summary 
matrix is presented outlining the advantages and disadvantages of each 
statistic. In addition, a flow chart is included to assist test evaluators 
make decisions concerning the selection of a discrimination index. The paper 
concludes with a discussion concerning the practical benefits of each 
statistic, as well as their relative costs and "ease of use" to the 
statistically unsophisticated test evaluator. 



EKLC 



3 



Military Test Analysis 
3 

Discrimination Indices Commonly Used in Military Training Environments: 
Effects of Departures from Normal Distributions 

Introduction 

The statistical procedures commonly called item analysis are one way 
measurement specialists appraise and improve the quality of tests. Of the 
many forms of item analysis, item difficulty and discrimination indices are 
most often computed. These statistics provide valuable information concerning 
the difficulty of the test, as well as the degree to which the test items 
differentiate between varying achievement levels of students. These data can 
then be used to increase the reliability of the test (Guilford, 1954). 

Although psychometricians working in military training environments 
recognize the importance of item analysis, they often use criterion-referenced 
tests (CRTs) to measure student achievement, which frequently produce score 
distributions that are either bimodal or negatively skewed. Consequently, 
they work with test score frequency distributions which violate the assumption 
of normality, an assumption conmonly held by mar\y of the "classical" item 
analysis statistics, such as the upper-lower indices and the point-biserial 
correlation coefficient. In addition, the small sample size (N = 15 or less) 
and variability in student achievement found in many military training 
pilot-stuc|y scenarios preclude the application of more sophisti:ated item 
analysis strategies, such as the Rasch technique.^ 

Several researchers (e.g.. Berk, 1984; Popham, 1981: and Roid 4 
Hala(tyna, 1982) note that several new CRT-specific item analysis techniques. 



see halac^yna and Raid (1979) for a discussion concerning Rasch analysis 
and CRT. 



EKLC 



4 



Military Test Analysis 
4 

or "jrjstructional-sensitivity indices, as described by Halaclyna and Roid 
(la61), have been proposed to address the problems associated with item 
analysis in CRT testing situations For example, the pre-to-post difference 
fndex (PPDI) introduced by Cox and Vargas (1966) and the percentage of 
possiDle gain (PPGJ developed by Brennan and Stolurow (1971) were both 
developed to produce item sensitivity indices more appropriate for 
criterion-referenced testing situations. These indices are excellent methods 
for obtaining information concerning the quality of CRT items. Unfortunately, 
they require the test to be a*iinistered to students before and after 
instruction.^ Often, the military test designer does not have this luxury. 

Another CRT approach uses two different groups of students, one group 
exposed to instruction, while the other group serves as the control (Ellis & 
Wulfeck, 1982; Popham, 1981). P values are calculated for each group, and the 
p results from the uninstructed group are subtracted from the instructed group 
p results, resulting in the discrimination index D^^^.g^. Although this 
strategy provides solid infonnation concerning the instructional sensitivity 
of the items. Its major disadvantage is that two groups of students are 
needed, a requirement not alw^^ys easy to meet in a military testing 
environment. In addition, the two groups tested must be identical with the 
exception of treatment, otherwise, variance in the difficulty indices might be 
attributed to confounding factors outside of the instruction (e.g., one group 
might be inherently more intelligent than the other group). The problem of 
randomly assigning stuaents to treatment and control groups might be beyond 

^Popham (1961) notes that an additional disadvantage to these indices is 
that the pretests might be reactive, and therefore sensitize studercs to 
certain items on the pretest 



EKLC 



5 



Militao' Test Analysis 
5 

the control of the test specialist, making this method difficult to implement 
as well. 

It is often difficult, if not impossible for the military test 
specialist to obtain /pretest scores, or randomly assign two groups of 
examinees to instruction or control groups. Therefore, the test 
designer/evaluator is faced with the problem of maximizing the amount of data 
concerning test quality that can be gathered fronj one administration of the 
test. 

One strategy which appears to share the characteristics of both 
classical techniques and the CRT item sensitivity measures is the Brennan 
index (Brennan, 1972). This scale Is implemented by setting a cut score for 
mastery on the test, and then dividing the test results into two groups 
(liiasters and nonmasters). To obtain BI, the difficulty indices for the 
nonmasters are subtracted from the indices for the masters (by item). This 
method is conceptually similar to the upper and lower groups comparison used 
in classical item analysis (see, for example, Kelley, 1939). The two methods 
differ in interpretation, however, since one cannot be certain that those in 
the upper group are truly masters, »rfiile those in the lower group are 
noniiiasters. (It should be noted that this same criticism can be applied to 
the Brennan's technique if the cut score is determined capriciously rather 
than in a systematic and logical manner.) 

When clear-cut aiastery or non-mastery cannot be determined (or is to 
be determined later in the test development process by comparing student 
performance in the field with their test scores) test specialists must rely on 
the traditional discrimination indices, despite less than optimum data 
analysis conditions. Although these statistics and their use have been 
described in detail by earlier researchers (e.g., Cureton, 1957; Ebel, 1954; 

ERIC 6 



Military Test Analysis 
6 

Englehart, 19bb; Johnson, 1951) the effects of the violations of the 
assumption of normality, which commonly occurs in CRT training environments, 
must be studied in more detail. The purpose of this paper is to describe the 
findings of a stucly which compared the results obtained from four different 
"classical" discrimination indices (two versions of the upper-lower index, 
rp(j, ana phi), when test score distributions were systematically varied 
(normal, bimodal, and negatively skewed) to represent test scores frequently 
occurring in military testing situations. The paper discusses the practical 
benefits of each statistic, as well a? their '"epse of use" to the 
statistically unsophisticated test evaluator. 

Method 

Sample and Instrumentation A set of 110 simulated subjects (Ss) were 
created to represent students enrolled in a military training program. The Ss 
were "aaninistered" three 20 item tests, scored in a dichotomous manner, with 
one point assigned to a correct response, and 0 assigned to an incorrect 
answer. T1e^KR-21 internal conisistency reliability Index for the normal 
distribution test was 0.77. The bimodal distribution test KR-21 was 0.87, and 
the KR-21 coefficient for the ske*^ed test was 0.78. The cut score on each of 
the tests was set at 10 points. 

Procedures Three data bases (normal, bimodal, and skewed) were 
constructed by varying the distributions of the three sets of simulated test 
scores. The frequencies cf items correct for each S were determined first, 
dependent on the desired shape of each data base. Next, the item(s) each S 
answered correctly (1-20) were randomly selected. This randomization produced 
mean p values of 0.52 for both the normal and bimodal curves. As expected. 



ERIC 



7 



Military Test Analysis 
7 

trie mean p for the skewed distribution was higher (0.74), because more 

3 

subjects were assigned higher test scores. 

The normal curve test score distribution was designed to represent 
the "control/ for which the statistics could be compared, since the majority 
of psychometric measures conmonly used by evaluators require the criterion 
score variables to be normally distributed. In addition, it represented a 
conriDn frequency distnbution for achievement or aptitude tests used in 
military settings. In terms of the descriptive statistical properties of the 
normal distribution data base, the mean was 10.5, with a standard deviation of 
4.29. There was a 0.0 value for the skewness coefficient. 

Ttie second data base corjstructed was biroodally distributed. This 
form of a score distribution is often found in testing situations where there 
are a group of masters and nonmasters. It also occurs in situations where one 
group of students receives instruction, while another group does not. This 
method of stu(bring test items is reconinended by Ellis and Wulfeck (1982) in 
their Handbook for Testing in Wavy Schools . The mean score for the bimodal 
simulation was 10.2i>. The standard deviation was 5.38, while the skewness 
coefficient was 0.U2. 

The third data set (skewed distnbution) represented a mastery 
learning situation. The negatively skewed distribution is commonly found in 
military environments, where a majority of the students pass the test. These 
simulation data had a mean of 14.9, and a standard deviation of 3.86. The 
coefficient of skewness was found to be -0.84, indicating a moderately 
negatively skewed distribution of the test scores. 



0.52 p value was ideal for the siroulatfon since roost authorities 
recommend a 0.50 p value to study item characteristics (e.g., Kelley, 1939). 



Military Test Analysis 
8 

A summary of the statistics describing each data base (normal, bimodaU 
skeweo; appears as Table 1. 

insert Table 1 about here 



Statistical Analyses Five item analysis statistics wrre calculated 
in this stuc^: the p statistic, two versions of the upper-lower group 
statistics (Dl and bZ) , Xiie phi coefficient, and the point-bir>erial 

correlation [r . ). 

pi> 

The difficult index, p, was calculated using the standard formula 
appearing as equation one: 



number of correct item responses 

P=-^ =. (1) 

total number of item responses 



Dl was obtaiRed by separating those Ss who mastered the learning 

(roasters) from those who failed the test inonmasters) as suggested by Brennan 
4 

(ly72). (A similar strategy is also used when groups of instructed r.hd 
uninstructed Ss are available for studying test item characteristics. In this 
case, one simply substitutes those Ss who received instruction for the 
roasters, and those Ss who did not receive instruction for nonroasters.) In the 
cases of the normally and bimodally distributed test scores, this strategy was 
also equivalent to the upper and lower half strategy, because in this stucily. 



mastery was determined by being assigned a test score of 10 or greater 



EKLC 



9 



Military Test Analysis 
9 

the mastery test score was aiso the median score for the two data bases. A p 
value was calculated for eacji group* and the resulting proportions were 
subtractea from each othe^. VJiis statistic is shown as equation two: 

MC NC 

Dl = „ (2) 

M N 

where: 

MC = masters who answered correctly 

M = tota) number of masters 
NC = nORin^jters who answered item correctly 

H = xoti\ nunber of nnn masters 

U'^ was calcylated it\ a manner similar to Dl, however, only the upper 
and lower 27% test scores we^e used for the comparisons. An early study by 
Kelly (1939) demonstrated th^t this strategy was the inost desirable method for 
studying the effectivenes$ of items. The method for obtaining 02 appears as 
equation o: 

UC LC 

02 = — (3) 

U L 



10 



Military Test Analysis 
10 



where: 



UC = Ss in upper 27% answering correctly 



U = total number of Ss in Upper 27% 



LC = Ss in lower 27% answering correctly 



L = total number of Ss in Lower 27% 



Both ul and Li2 have two major assumptions associated with their use: 
(1) a normal distributiion of criterion scores 
U) equality of moan standard errors of measurement 
in the upper and lower groups 
See Cureton (1957) for a discussion concerning these two assumptions. 

The phi coefficient was the third discrimination Index used to 
evaluate the data. In terms of the statistical assumptions associated with 
the use of the phi coefficient, phi "can be used in any situation in which a 
measure of the association between two dichotomous variables is desired" 
(Allen & Yen, p.37, 1979). In thi> sti:dy, the variables were dichotomized by 
comparing frequencies on each item (pass/fail) with frequencies of test 
performance (pass/fail). The formula used to obtain the phi values was as 
f ol 1 ows : 





(4) 



n 



ERIC 



11 



Military Test Analysis 
11 

vhere: 

n = nunter of Ss 
k 

where: fp = observed frequency 
= predicted frequency 

The point-Diserial correlation was the final discrimination statistic 
usea in the study. This statistic was obtained by employing the formula shown 
in equation 5; 



nJX Y 



(5) 



where: X = test item score (0 or 1) 

Y = total test score (0 to 20) 
N = sample size 



There two assumptions most commonly applied to the use of the 
point-biserial : 

(1) a normal distribution of criterion scores should be present 
(Z) variables should be measured using interval or ratio scales 

Analysis of variance (ANOVA) was chosen to estimate significant 
differences between the various discrimination index values obtained in the 



ERIC 



12 



Military Test Analysis 
12 

item analysis. The two key assumptions regarding the proper use of ANOVA are 
(Kachigan, 19b2): 

(1) Tne scores in each population are normally distnbuted 

Vd) The k population variances are equal (homogeneity of variance) 
Upon rejecting the null hypothesis (that the mean values of the item 
discrimination indices are equal) the paired t*test was applied to the two 
indices producing the largest average value, to determine if there was a 
significant difference between the results. The two assumptions for the use 
of the t-test are: 

(1) the scores are normally distributed. 

{2) The data are interval in nature 

Tne assumption of normality shared by all tests (with the exception 
of phi, which is a "distribution free" statistic) was obviously met in the 
normal distribution data base (see Table 1). Just as obvious, was that the 
bimodal and skewed data bases violated this assumption. This is not a 
proDlero, however, since the central focus of the stuc|y was to assess the 
impact of violations of this assumption. 

All data were measured using an interval rating scale. Therefore, 
the assLBnption of interval data for analysis was held during the simulation as 
well. 

In terms of the equality of mean standard errors of measurement in 
the upper and lower groups, since the items correct for each S were randomly 
assigned, it was concluded that the upper and lower groups would have equal 
errors of measurement. 

With regard to the ANOVA assumptions, the discrimination index values 
analyzed were somewhat normally distributed, with a slight degree of skewness 
in the data sets. The skewness coefficients shown in the data are not of 



EKLC 



13 



Military Test Analysis 
13 

major concern, however, since most authorities agree (see, for example, C<:Ties 
& Klare, 1967) that ANOVA (as well as the T-test) is a robust statistic with 
regard to violations of the assumption of normality. In tenns of the equality 
of population variances, there do not appear to be significant differences 
between the item indices studies, therefore, the second assumption was clearly 
satisfied. 

Tne descriptive characteristics of the variables studied during the 
ANOVA and t- tests appear as Appendix A. 

Resul ts 

Tne results of the analyses for each item appear as Table 2. As 
expected, the discrimination indices produced different values for different 
score distributions. 



insert Table 2 about here 



The normal distribution average p value was 0.52. The mean value for 
each of the statistics showed clearly that D2 (upper-lower 27$) produced the 
largest discrimination values, r^^^ the second largest values, and phi and Dl 
(upper-lower 504) produced the least discriminating values. Interestingly, 
pni and 01 values were identical. ANOVA results suggested that the 
differences between the discrimination indices were statistically significant 
between the groups F(3,76) = 10.5947, p^.Ol. In addition, the t-test applied 
to 02 and r^^ snowed that the differences between these two indices were 
significant t(19) = 6.92, p-c.Ol. 



Insert Table 3 about here 



ERIC ^ 



Military Test Analysis 
14 

Analysis of the bimodal discrimination indices produces similar 
results. The average difficulty (p) was 0.52. In addition, D2 produced the 
largest values, followed by r^^, then phi and Dl . Again, the values 
obtained using phi and Ul were identical. The results of the ANOVA appear as 
Table 4, snowing a significant difference between the 4 groups of indices 
F(J,7b) = 16.i)137, p < .01. The t-test demonstrated that D2 was again 
superior to r^^ for the bimodally distributed test scores t(19) = 9.88, p -= 
.01. 

insert Table 4 about here 

The skewed distribution analyses suggested a slightly different 
pattern. The mean p value for these data was 0.74, clearly showing that more 
Ss got the items correct than the other two test distributions, an expected 
finding for a simulation designed to represent a CRT situation. D2 again 
produced tne largest values, followed by r^^. However, in this case, Dl 
results were larger than those indices obtained using phi. ANOVA results 
(Table 5) show that there were significant differences between the groups 
F(3,76) = 6.4117, p-e.Ol, however, there were no significant differences 
between and r^^, as suggested by the t-test; t(19) = 1.04, p = ns. 

insert Table 5 about here 

Discussion 

The data strongly suggest that the distnbutions of scores influence 
the values obtained from the various indices. Clearly, military evaluators 
snould consider the frequency distributions of their test scores when 
selecting iteni discrimination indices. 



Military Test Analysis 
15 

Onf? of the most interesting findings of the study was that the phi 
coefficient and Dl statistics produced identical values when the test data 
were biniodally and normally distributed These data suggest that if evaluators 
are faced with analyzing data with these distributional characteristics, 
simply calculating the 5u% upper-lower index will produce values identical to 
the phi (provided the cut score happens to be at the median}. The evaluator 
can then use a Pearson r table to estimate the sigm'ficance of the index, 
since phi is a special case of r. This strategy can save the evaluator time, 
because phi is much mere difficult to compute than Dl. 

Another interesting finding was that in the case of the skewed 
distribution, there were no significant differences between the values 
produced by D2 and r^^. These results suggest that either method can be 
used in a skewed distribution setting to obtain essentially the same 
discrimination values. Therefore, if limited statistical analysis resources 
are available, the evaluator can use that statistic most easily computed. 

Based on the results of this stucjy and the review of the literature, 
the flow chart appearing as Figure 1 was constructed. Test evaluators can use 
this flow chart to select that item discrimination statistic most appropriate 
for their own unique testing situation. The chart begins with the most 
desirable method for obtaining item instructional sensitivity data. If the 
conditions cannot be met for the use of this statistic, then, the second most 
effective statistic is recommended, and so on. (The method of ranking the 
desirability of the statistics was based on the internal and external threats 
to validity associated with their use*) 

It is important to note that each statistic must be interpreted in 
Its own unique woty. An acceptable value for phi might be totally unacceptable 
for r j^, since each index produces a range of values specific to itself. 



EKLC 



16 



Military Test Analysis 
16 

For this reason, if quality assurance requirements are placed in the test 
development product standards, both the statistic and the general level of 
acceptance should be specified. The problems associated with the violations 
of the assumptions of normality should also be discussed, outlining fc-hich 
statistic is preferred under a given set of circumstances. This will 
safegiiard both the evaluator and test developer from making inappropriate 
interpretations of the discrimination statistic values. This recommendation 
is supported by Englehart (1965) who suggests that critical values for 
acccrpting en item's discrimination power are a function of the difficulty of 
the item. 

In tenns of the ease of use, costs, and practical benefits of each 
stetistic, the availability of computer resources is a major determining 
factor in the selection of an item discrimination statistic. Test designers 
and evaluators who have computer facilities with item analysis programs 
available can generally disregard the "difficulty" of using various statistics 
since they are automatically calculated by the computer. However, when 
dealing with small Ns, where it is not cost-efficient to code and develop a 
oata base, and finally analyze the data, or, where adequate computer 
facilities are not available, the ease of computation is very important. 
Undoubtedly, Dl is the easiest of item discrimination indices to obtain. The 
evaluator must simply rank order the results, divide into upper and lOKer 
groups, and compute the results. This method has the added advantage, in the 
case of normal and bimodal distributions, of being a good estimate of phi. 
Therefore the significance levels of the indices can be estimated easily. 
Next easiest is the upper lower 271. However, « large N should be nade 
available (at least 12 Ss in both the upper and lower groups) otherwise the 
simple bU% split should be used. Computation of both phi and r^^ are more 

Er|c 17 



Military Test Analysis 
17 

difficult, and in large N situations, should be employed only through the use 
of a computer. The statistically unsophisticated evaluator would clearly have 
more difficulty using these formulae than the simple upper lower groups 
discrimination index. Table 6 provides a summary matrix of the assumptions, 
limitations, and ease of use of the statistics described in this study. 

insert Table 6 about here 

Recommendations for Future Research 

Several problems shoula be investigated in the future to further the 
knowledge base concerning the use of classical item analysis in CRT settings. 
One interesting question would be to determine the point where skewness begins 
to effect the values produced by the discrimination indices. The present 
study has demonstrated that a moderately skewed distribution produces 
differences between the statistics that are not found in bimodal and nomioTly 
distributed test score distributions. A study examining differing degrees of 
skewness may be needed to help evaluators and researchers select the statistic 
most appropriate for that level of skewness. 

This study employed items with very little variance in p values, by 
randomly selecting correct responses for each item. Although these results 
are typical for CRT environments (in that most students get most items correct 
resulting in a small degree of variance) a stucly using data with items of 
differing item variances may reveal different results. This nay be an 
important issue to examine in the future because it is sometimes desirable to 
use items of differing 'Jifficulty values, even in CRT situations. 

Finally, the mathematical reasoning behind the equal values for phi 



Military Test Analysis 
18 

and Dl should be explored, to determine whether the results of this study are 
a special case of these two statistics (when median and cut scores fall at the 
same value, and data are normally or bimodally distributed) or whether the 
mathematical short-cuts derived from the study can be generalized to other 
data sets as well . 



ERIC 



ID 



Military Test Analysis 
19 

References 

Allen, M.J., & Yen, w.M. Introduction to Measurement Theory . Monterey ,CA: 

Brooks/Lole Publishing to., 1&7!#. 
Berk, R.A. (£a. ) A Guide to Criterion-Referenced Test Constructi on. (2nded.) 

Baltimore, MD: Johns Hopkins University Press, 1964. 
Brennan, R.L. A generalized upper-lower item-discrimination index. EduGStional 

and Psychological Measurement , 1972, 32: 289-303. 
Brennan, R.L., & Stolurow, L.M. An empirical decision process for formative 

evaluation. Research Memorandum No. 4 . Cambridge, MA: Harvard CAI Lab, 

1971. 

Cox, R.C., & Vargas, J. A comparison of item selection techniques for norm - 
referenced and criterion-referenced tests . Presented at the American 
tducational Research Association, San Francisco, 1979. 

Cure ton, L I. The upper and lower twenty-seven per cent rule. Psychometrika . 
1957, 22: 293-296. 

ELel, R.L. Procedures for the analysis of classroom tests. Educational and 

Psychological Measurement . 1964,, 24: 85-90. 
Ellis, J. A., & Wulfeck, W.H. Handbook for Testing in Navy Schools . San Diego, 

CA: Kayy Personnel Research and Development Center, 1982. 
Englehart, M.D. A comparison of several Item discrimination indices. Jojrnal 

of Educational Measurement , 1965, 2: 69-74. 
Games, P. A., 4 Klare, G.R. Elementary Statistics . New York: McGraw-Hill, 1967. 
Guilford, J.P. Psychomtric Methods . (2nd ed.) New York: McGraw-Hill, 1954. 
Hala(^yna, T.M., & Roid, G. The stability of Rasch Item and student achievement 

estimate for a criterion-referenced test . Presented at the annual meeting 

of the National Council on Measurement in Education, San Francisco, 1979. 



20 



Military Test Analysis 

20 

Halactyna, T.M.. & Roid, G. The role of instructional sensitivity in the 

empirical review of criterion-referenced test items. Journal of Educational 

Measurement . ISfal, Ifa, 39-53. 
Johnson, A. P. Notes on a suggested index of item validity: the U-L index. 

Journal of Educational Psychology . 1951, 42: 499-504. 
Kachigan, S.K. Multivariate Statistical Analysis: A Conceptual Introduction . 

New York: Radius Press, 1982. 
Kelley, T.L. Tne selection of upper and lower groups for the validation of 

test items. Journal o f Educational Psych ology. 1939, 30: 17-24. 
Popharu.W.J. Modern Educational Measuremen t. Englewood Cliffs, NJ: 

Prentice-Hall. 1981. 
Roid. G.. i halacjyna, T.M. A Technology for Test- Item Writing . New York: 

Academic Press. 1982. 



21 



I- 

m 
z 
< 

OICVJ 

h 

t 

to 

•r 
r» 
•r 

E 



wt 1; KMipriK swrisTics; mim itsi smt nhiBiBon* 



Number 




Variance 




Si<] Lrror 


Skpwness 


Kijrtosis 


Normal 


110 


10.5000 


18.41/4 


4.2916 


U.4092 


O.UOOUlJ 


2.3835.^ 


Blmo'jal 


110 


10.2545 


?8.90/l 


5.3/65 


U.5126 


0.02U^iU 


l.i)9Bli 


Skewed 


110 


14.9000 


14.8982 


3.8598 


0.3680 


■0.341Ub 


3.iO/b6 



ERIC 



00 



23 



TABLE h ITEH ANALYSIS RESULTS 



(A 

>i 
r- 

C 
< 

dcvj 
h 

>^ 
4^ 



o — 
ERIC 





] 


? 


1 

J 




C 

J 


c 

0 


/ 


8 


9 


1 rv 

10 


11 


12 


13 


14 


15 


16 


17 


Id 


19 


20 


f1 


Si) 


P 










.4!) 


CI 


.54 


.58 


.47 


.46 


.53 


.55 


.56 


.53 


.58 


.57 


.53 


.50 


.48 


.49 


.52 


.0!) 


01 . 


.2? 


-.30 


.47 


.38 






.2/ 


.2b 


.47 


.38 


.29 


■ JJ 


IR 
• 10 


.Jj 


. jj 


.5/ 


.25 


.42 


.20 


.40 


.35 


.11 




40 






■ jj 


t/j 


.30 


.53 


.37 


.67 


.60 


.56 


,86 


.37 


.46 


.66 


.70 


.44 


.46 


.37 


.43 


.53 


.14 


Phi 


?? 


» j\) 


• 4/ 


* Jo 


Jo 


.29 


,27 


.26 


.47 


.38" .29 


.55 


.18 


.33 


.33 


.5/ 


.25 


.42 


.20 


.40 


.35 


.1) 


'Db 


I*; 


•He 


.4/ 


•«l<* 


.49 




.44 


.30 


.53 


.48 


.40 


.62 


.35 


.41 


.48 


.58 


.31 


.39 


.28 


.40 


.42 


.09 
















































P 


.49 


.49 

• 1 / 


.55 


• j\j 


4Q 


• jj 




• M 


.54 


.51 


.52 


.54 


.55 


.54 


.48 


.49 


.48 


.50 


.50 


.52 


.52 


.03 


PI 


.56 


45 


• 40 


i44 






.53 


.48 


.62 


.46 


.51 


.33 


.42 


.62 


.44 


.49 


.55 


.47 


.33 


.58 


.49 


.08 


02 


.77 


.70 


.57 


.54 


.60 


•63 


.73 


.54 


.84 


.5/ 


,67 


.56 


.73 


.73 


.57 


.60 


.63 


.5/ 


.57 


.73 


.64 




Phi 


.56 




.48 


.44 


.49 


.55 


,53 


.48 


.62 


.46 


.51 


.33 


.42 


.62 


.44 


.49 


.55 


.47 


.33 


.58 


.49 


.08 




.59 


.57 


.51 


.48 


.52 


.55 


.59 


.47 


.68 


.48 


.55 


.43 


.51 


,63 


.4/ 


.50 


.59 


.55 


.42 


.61 


.54 


.07 



P 


.71 


.67 


.75 


.77 


.79 


.69 


.71 


.72 


.77 


.76 


.76 


.79 


.80 


.70 


.76 


.78 


.75 


.72 


.72 


.75 


.74 


.04 


PI 


.54 


.42 


.30 


.46 


.48 


.29 


.32 


.33 


.39 


.38 


.67 


.63 


.42 


.53 


.38 


.40 


.30 


.40 


.25 


.37 


.41 


.1) 


P2 


.67 


.47 


.33 


.53 


.40 


.40 


,53 


.46 


.33 


.47 


.44 


,53 


.40 


.73 


.50 


.30 


.40 


.43 


,36 


.50 


.46 


.11 


Phi 


.42 


.31 


.24 


.39 


.42 


,?3 


.25 


.26 


.33 


.32 


.56 


.55 


.37. 


.41 


.32 


.34 


.24 


.31 


.20 


.30 


.34 


.10 


''Db 


.57 


.43 


.28 


.49 


.51 


.40 


.43 


.45 


.41 


.41 


.52 


.62 


.37 


.62 


.44 


.35 


.33 


.45 


.33 


.46 


.44 


.09 



-■■ — ^ — ^ 



(A 
'r 
(A 

I— 

c 
< 

4)CVI 
h 

f 



TABLE 3; ANOVA: NORMAL DISTRIBUTION 



Jource of Variation 


OF 






i^-btat 


Adionq Groups 


3 




0.1421 


10.5947 


Within Groijps 


76 


1.0195 


0.0134 




Total 


n 


1.4459 







Group Statistics 



Group 


N 




U-SSQ 


Mean 


C.V. 




S.E, (CV) 


NormOl 


20 


6.9400 


2.6386 


0.3470 


31 .7361 


0.1101 


5.5001 


NomiP? 


20 


10.5000 


5.9084 


0.5250 


27.49^2 


0.1443 


4.6645 


Norinphi 


20 


6.9400 


2.6386 


0.3470 


31 .7361 


0.1101 


5,5001 


Nonncorr 


20 


8.4500 


3. 7329 


0.4225 


21.9074 

' t 


U.092b 


3.6263 



(/) 

>) 

r— 

(Q 
C 
< 

(/)^ 
OICSJ 

h 



TABLf 4: AfWVA: BITOAL DISTRIBUTION 



Source of Variation 


OF 


SS 


MS 


F-Stat 


Among Groups 


3 


• 0.3106 


0.10^6 


16.5137 


Within Grout J 


76 


0.4765 


0.0063 




Total 


/9 


0.7871 







Group Statistics 



Group 


N 


Sum 




Mpan 


C.V. 


s.o. 1 


S.E. IC«) 


BiTOdDl 


20 


9.8000 


4.9222 


U,4%0 


16.2323 


0.0795 


2.6333 


BiiiO(j02 


20 


12.8500 


8.4041 


0.6425 


13.7355 


0.08B3 


2.2124 


Biino'lptii 


20 


^.3000 


4.9222 


0.4900 


16.2323 


O.U7J5 


2.6333 


Biinodcorr' 


20 


10.7000 


5.8126 


0.5350 


12.7<:79 


0.06S1 •• 


2.0448 



ERIC 



29 



r— 

c 



TABLE 5: mU: WIVELY OTP OISTKIUUTION 



Source of Variation 


OF 


SS 


MS 


F-Stat 


Ainonq Groups 


3 


0.1719 


0.0573 


5.4173 


Within GrouDS 


/6 


0.8039 


0.0106 




Total 


79 


C.9758 







Group Statistics 



Group 


H 


SuiF 


U-SSQ 


Meaf, 


C.V. 




S.E. (CV) 


SkewOl 


20 


8.2600 


3.6438 


0.4130 


27.0665 


0.1118 


4.58^4 


SkewD2 


20 


9.1800 


4.4338 


0.4590 


23.4531 


0.1076 


3.9069 


Skewphi 


20 


6.7700 


2.4757 


0.3385 


29.0762 


0.0984 


4.9709 


Skpwcorr 


20 


8.8700 


4.0961 ' 


0,4435 


20.8367 1 


0.0924 


3.4346 



ERIC 



31 



(/) 

'r 

(/) 
>> 

c 

(/)^ 

a)c\j 
h 

•r 



TABLE 6; S 





C'ln'liti'^ns and Assumptions 


Calculations 


1 imit';^t innc 

L I'll! tOl ^Ullb 


PPPI 


1. prp ft post tpst ^s 

2. N 12 (both groups) 


"ea'.y by ,:and" 


must be able tc 
pre am posi (.esi 


PPG 


1. prp & pnst test Ss 

2. fl 12 (both groups! 


"easy by hand" 


must be able to 
pre ana post test 


BI * 


1. iiaster/n'>niiastpr scores 

2. N 12 (both groups) 


"easy by hand" 


must be able to 
identify masters 
and nonmasters 




1. inst/uninst group sessions " 

2. N 12 (both groups) 


"easy bv hand" 


fii'jst be able to 
randomly assign Ss 
to both groups 


01 


1. norin distribution 

2. = Mean std errors 

3. N 12 


"easy by hand" 


1. assumptions 

2. upper group 
may not be masters 




1. nn>'(ii distribution 

2. ' Mean std errors 

3. N 12 


"easy by hand" 


1. assumptions 

2. upper group 
may not be masters 




1. norm distribution 

2. interval scale 

3. N 12 


nead computer 


1. assumptions 

2. upper group 
may not be masters 


Dhi 

1 


dirhotofiious variables 


n^e^ cofiiDut^^r 


v*ui iiurtf for pdaS 

fail must be set 
correctly 



sp'' Bprk (1984) for an excellent discussion on the statistical writs of these statistics. 



ERIC 



33 



Military Tes^t Analysis 
Fiqure 1: DISCRIMINATION INDEX SELECTION DECISION TREE 




YES 



(1) 



ft Wulfeck, 

lSb2) 



YES 



U) PPDI (Con ft 

Varsas. 1966} 
H2] rPG (firennan ft 
S^ulurow, 19711 




DESCRIPTIVE SmiSTiCS: \m A,mySIS Ii^DlCtS 

•r 







Variance 




Std Frrnr 


jKPwnpss 




NnrmP 


?0 


0.5235 


0.0022 


0.0470 


0.0)05 




Kyrtosis^ 
£.c8o4B 


NonnPI 


» 


0.3470 


0 0121 


o.nei 


0.0246 


0 46208 




NormO? 




0.5?50 


0.0208 


0.1443 


0.0323 


0 55717 




NormPhi 


20 


0.347C 


0.0121 


0.1101 


0.0246 


0.46208 




KoPtiCorr 


20 


0.4225 


0.0086 


0.0926 


0.02D7 




2.bl59b 


BiiwdP 


20 


0.5125 


0.0001 


0.0.'69 


0.0060 






Bimo'lPl 


20 


0.4900 


0.0063 


0.0795 


0.0178 

V ■ V > 1 IJ 




Zi8242/ 


SifnodPZ 


20 


0.R25 


0.0078 


0.0883 


0.0197 






Bl^iOdPhi 


20 


0.4900 


0.001)3 


0.0795 


O.Ol/o 


Wi J Il7J 




BifliodCorr 


20 


0.5350 


0.0046 


0.0681 


0.0152 


0,21241 


2.414S3 


0^^ At 


20 


0./435 


0.0013 


0.0365 


0.0082 


-fl 9QtiW 
"UiODJc 


2.09845 




20 


0.4130 


0.0125 


0.1118 


0.0250 


0.79739 


2.99520 




20 


0.4590 


0.0116 


0.1076 


0.0241 


0.88193 


3.628/5 


^kewPhi 


20 


0.3385 


0.009? 


0.0984 


0.0220 


0.83852 


3.14578 




?0 


0.443'^ 


n.ooas 


0.0924 


0.020/ 


0.35557 


2.59345 



ERIC 



35 



36 



